A deep learning model is a computational graph that try to map inputs, each drawn from some dataset with common characteristics to outputs drawn from a related distribution.
It is a graph that has many layers (Simply Nested Functions).
That is why gradient descent is a bit tricky here.
Regression vs Classification Neural Networks
Classification
Use a sigmoid (or another squashing nonlinearity) on a single output so the readout behaves like a probability or score — that setup is classification.
Regression
Use a linear readout from the hidden layer, e.g. \(\sum_i w^2_{i,1}\, H_i\), to predict a real number — that setup is regression.
Binary Classification
Binary Classification Neural Network
(One vs All) Multi-class Classification
Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers
Note. The output vector has to be encoded using one hot encoding
Training a Neural Network
L is the loss function.
Y is the target value.
\(L(Y, O)\) is the loss between the target value and the output.
We need to find each \(\frac{\partial L}{\partial w^k_{ij}}\)
Calculus chain rule is needed to find the derivatives.
TABLE OF CONTENTS
1. Neural Networks for Classification/Regression✓
2. The Softmax Cross Entropy Loss Function●
3. Multi-class network: computational graph○
More Accurate Multi-class Classification
Previously, we used sigmoid with mean squared error (MSE) as loss function
Pros
Convex (i.e., steeper gradient as you are away from the correct output)
Acceptable performance
Cons
Outputs can not be interpreted as probability.
Does not work with all problem types
In practice, we use Softmax + cross Entropy loss
Softmax Function
Softmax extends binary logistic regression idea (probability adds up to 1) into a multi-class world.
It helps training converge more quickly than it otherwise would.
Let \(z\) denote the logits (the raw, unnormalized scores output by the network before applying softmax).
Softmax Function
How it works?
Suppose we have an input vector from the previous layer
[5, 3, 2]
One way to transform these values into a vector of probabilities
A better way
Intuition
Softmax Intuition
Softmax is more strongly amplifies the maximum value relative to the other values.
Softmax is partway between normalizing the values and actually applying the max function!
Our Neural Network
Neural Network Architecture
Loss Function Diagram View
The incoming vector \(\textbf{x}\) is transformed by the softmax function to produce the vector of probabilities \(\textbf{p}\).
The vector of probabilities \(\textbf{p}\) is then compared to the vector of actual values \(\textbf{y}\) using the cross entropy loss function resulting in a scalar loss value.
Cross Entropy
Recall that any loss function will take in a vector of probabilities and a vector of actual values
The cross entropy loss function, for each index i in these vectors, is:
\(\ell_i\) = \(-y_ilog(p_i)-(1-y_i)log(1-p_i)\)
Intuition
Since \(p_i\) is a probability between 0 and 1.
Our loss can be defined as follows.
if \(y_i = 1\) : Loss = \(-log(p_i)\)
if \(y_i = 0\) : Loss = \(-log(1 - p_i)\)
Cross Entropy Loss Intuition
Gradient Computation
The real magic happens when we combine this loss with the softmax function
Recognize softmax probability definition \(p_1 = \frac{e^{x_1}}{e^{x_1}+e^{x_2}+e^{x_3}}\)
\(\frac{\partial \ell_1}{\partial x_1} = p_1-y_1\) (derivative of the log-likelihood \(\ell_1\))
TABLE OF CONTENTS
1. Neural Networks for Classification/Regression✓
2. The Softmax Cross Entropy Loss Function✓
3. Multi-class network: computational graph●
Neural Network As Computational Graph
Two-layer network for multi-class classification (softmax output).
Architecture: one hidden layer (here with sigmoid activations) and one output layer whose activations are logits\(N\); applying softmax yields class probabilities \(P\) (e.g. dog / cat / horse, or 3-way softmax in the diagram).
Task: predict a probability vector over classes — this is multi-class classification, not binary.
Neural Network As Computational Graph
Neural Network Computational Graph
\(\textbf{X}\) is a 4x3 matrix (4 samples, 3 features)
Compute the gradient of \(L\) (cross-entropy loss) with respect to \(W^{(2)}_{3,3}\) (backward path: \(L \leftarrow N \leftarrow W^{(2)}\); \(N\) are logits before softmax).
Compute the gradient of \(L\) with respect to \(W^{(1)}_{3,3}\) (full backward path through hidden sigmoid: \(L \leftarrow N \leftarrow M \leftarrow V \leftarrow U \leftarrow W^{(1)}\)).
Note.\(\odot\) gives \(4\times 3\) (one row per sample). \(B^{(2)}_{1,3}\) is shared across rows — sum across rows (column-wise over the batch) for \(\partial L/\partial B^{(2)}_{1,3}\).
Note.\(\odot\) gives \(4\times 3\) (one row per sample). \(B^{(1)}_{1,3}\) is shared across rows — sum across rows (column-wise over the batch) for \(\partial L/\partial B^{(1)}_{1,3}\).
Thank You!
document.addEventListener("DOMContentLoaded", function() {
function handleQuizClick(e) {
// Find closest list item
const li = e.target.closest('li');
if (!li) return;
// Find the checkbox strictly within this LI (not nested deeper)
// We can check if the checkbox's closest LI is indeed this LI.
const checkbox = li.querySelector('input[type="checkbox"]');
if (!checkbox) return;
// Verify strict parent-child relationship to avoid triggering when clicking parent Question LI
if (checkbox.closest('li') !== li) return;
// We found a quiz item!
// Prevent default only if we are handling it.
// If the user clicked the checkbox directly, it might toggle. We want to force it to our logic.
e.stopPropagation();
// Determine correctness
// Quarto/Pandoc sets the 'checked' attribute for [x] items in the HTML source
// We rely on the attribute because the 'checked' property might have been toggled by the click before we got here if we didn't preventDefault fast enough.
// Actually, getting the initial state is safer.
const isCorrect = checkbox.hasAttribute('checked');
// Force the checkbox to match its "correct" state visually (optional, but good for consistency)
// checkbox.checked = isCorrect;
// We just want feedback colors.
// Reset classes
li.classList.remove('quiz-correct', 'quiz-incorrect');
// Apply feedback
if (isCorrect) {
li.classList.add('quiz-correct');
} else {
li.classList.add('quiz-incorrect');
}
}
function initQuiz() {
// Enable checkboxes and style them
const checkboxes = document.querySelectorAll(".reveal .slides li input[type='checkbox']");
checkboxes.forEach(cb => {
cb.disabled = false;
// Prevent default browser toggling logic on the input itself, let our handler manage expectation
cb.onclick = function(e) { e.preventDefault(); };
const li = cb.closest('li');
if (li) {
li.classList.add('quiz-option');
// Direct listener on LI is sometimes more reliable than delegation in complex frameworks
li.removeEventListener('click', handleQuizClick);
li.addEventListener('click', handleQuizClick);
}
});
}
// Initialize on Reveal ready
if (window.Reveal) {
if (window.Reveal.isReady()) {
initQuiz();
} else {
window.Reveal.on('ready', initQuiz);
}
window.Reveal.on('slidechanged', initQuiz);
// Also on fragment shown, as content might appear
window.Reveal.on('fragmentshown', initQuiz);
} else {
initQuiz();
}
});